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A  b  struct. 

This  study  investigated  the  relationship  between  workload  characteristics  and 
process  speedup.  There  were  two  goals:  the  first  was  to  determine  the  functional 
relationship  between  workload  characteristics  and  speedup,  and  the  second  was  to 
show  how  simulation  could  be  used  to  determine'  such  a  relationship.  The  hypercube 
implementation  used  in  this  study  is  a  packet-switched  network  with  predetermined 
routing.  Message  processing  has  precedence,  so  nodes  are  interrupted  during  task 
processing. 

In  this  study,  three  independent  variables  were  controlled:  total  computational 
workload,  number  of  nodes  and  the  message  traffic  load.  The  workload  was  assumed 
to  be  balanced  across  the  nodes.  A  benchmark  program  was  executed  on  an  actual 
hypercube  and  the  results  were  used  to  validate  a  discrete  event  simulation  model  of 
hypvvcr.be  cessing.  Using  the  shr.nb.t  v  an  experiment  was  designed  to  cont  rol 
the  total  computational  load  over  two  levels,  the  number  of  nodes  over  five  levels  and 
the  message  traffic  load  over  four  levels  to  determine  their  individual  and  interactive 
effects  on  process  speedup. 

Regression  analysis  was  used  to  estimate  the  functional  relationship  between 
the  three  independent  variables  and  process  speedup.  The  results  show  that  a  com¬ 
plex  relationship  exists  between  workload  characteristics  and  cube  size.  As  more 
nodes  are  added,  the  computational  time  decreases,  but  at  the  same  time,  the  com¬ 
munications  overhead  increases  such  that  the  speedup  will  eventually  begin  to  de¬ 
crease.  The  point  where  speedup  starts  to  decline  is  dependent  upon  boll)  the 
computational  and  message  trallie  workload,  finally,  this  research  presented  an  al¬ 
ternative  methodology  for  peilormnnec  analysis  which  is  more  flexible  than  the  tra¬ 
ditional  methods,  furthermore,  this  methodology  can  be  extended  to  studv  other 
architect  ures. 


A  PERFORMANCE  STUDY  OF  THE  HYPERCUBE 


ARCHITECTURE 


I.  Introduction 


Background 

Parallel  processing  has  heroine  an  attractive  solution  for  applications  that  re¬ 
quire  a  large  amount  of  computation  in  a  short  period  of  time.  Since  the  computa¬ 
tional  requirement  of  a  single  problem  is  distributed  among  several  processors,  there 
must  be  some  mechanism  for  communication  between  processors.  The  manner  in 
which  a  multiprocessor  handles  communication  between  processors  classifies  it  as 
either  a  loosely-coupled  or  tightly-coupled  machine.  The  former  communicates  via 
a  common  memory,  whereas  the  latter  uses  a  message- transfer  system.  The  com¬ 
munications  required  of  a  process  directly  affect  the  time  needed  to  complete  the 
process. 

T lie  speedup  that  can  be  achieved  by  a  parallel  computer  with  n  identical 
processors  working  concurrently  on  a  single  problem  is  at  most  n  times 
faster  that  a  single  processor,  in  practice,  the  speedup  is  much  less,  since 
some  processors  are  idle  at  a  given  time  because  ol  conflicts  over  memory 
accesses  or  communieat  ion  pat  h.-.  .  . .  [1IW  tS  l] 

A  speedup  of  n  is  achievable  only  il  a  multiprocessor  is  operating  at  peak 
performance.  During  peak  performance,  t  lie  processors  are  doing  only  useful  work;  no 
work  is  redundant  and  no  ext  ra  inst  met  ions  are  executed.  '1'hat  is.  t  lie  parallelization 
does  not  require  more  instructions  t  han  a  uniprocessor  would  require  using  the  same 
algorithm.  An  ideal  speedup  is  impaired  by  several  factors  which  include: 


•  interprocessor  commumcat  ion. 

•  processor  synchronization. 

•  one  or  more  idle  processors. 

•  wasted  effort .  and 

•  process  in**  required  for  system  control  and  scheduling  [ST87]  . 

Ilypfrcubf 

The  hypercube  is  a  loosely-coupled  multiprocessor.  Its  interconnection  net¬ 
work  is  a  binary  n-cube.  so  it  connects  A  =  2"  nodes  where  each  node  is  a  processor 
with  its  local  memory  and  n  is  the  dimension  of  the  cube.  Figure  1  shows  the  logical 
topology  of  cubes  of  one-,  two-,  three-,  and  four-dimensions.  The  vertices  repre¬ 
sent  nodes  and  the  edges  represent  point-to-point  full-duplex  communication  paths. 
In  a  cube  of  dimension  n.  every  node  is  directly  connected  to  n  other  nodes  that 
are  called  nearest  neighbors.  Kvery  node  is  capable  of  communicating  with  every 
other  node,  but  messages  sent  to  non-nearest  neighbors  must  pass  through  inter¬ 
mediate  nodes.  The  performance  of  certain  algorithms  implemented  on  hypercubic 
architectures  has  been  measured.  Felton  <1  at.  recorded  efficiencies  over  90%  when 
solving  the  traveling  salesman  problem  [FlvSo].  Gutzmann  found  that  when  he  im¬ 
plemented  a  sorting  algorithm  on  a  hypercube,  a  larger  dimension  (two-dimensional) 
cube  "actually  performed  worse  t  ban. ..smaller  ones  for  a  given  list  size.  The  reason 
for  this  is  that  the  communications  overhead  ...is  usually  larger  than  direct  pro¬ 
cessing  costs.  . .  [(II  87].  !  his  part  it  ular  algorithm  also  required  processors  to  drop 
out  and  become  idle  before  the  sorting  was  completed.  1  lie  first  example  represents 
results  t  hat  are  encouraging  while  t  lie  ot  her  is  discouraging.  The  specific  concern  is, 
how  much  faster  can  a  process  complete  if  more  processors  are  added  to  the  job  and 
what  characteristics  of  the  workload  ailed  speedup  performance. 


Table  1.  Research  1  ly pot  hoses 
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i 
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Hi:  There  is  no  difference  in  hypercube  speedup  explained 
by  choice  of  benchmark  or  a  simulation  model  of  a 
hypercube  architecture  for  a  controlled  workload. 

H2:  There  is  no  difference  in  hypercube  speedup  explained 
by  total  computational  workload,  number  of  proces¬ 
sors,  message  traffic  load  and  their  interactions. 

H3:  There  is  no  difference  in  hypercube  speedup  explained 
by  the  distribution  of  burst  times  of  individual  pro¬ 
cessors  where  the  total  workload  on  each  node  is  the 
same. 


Statement  of  the  Problem 

The  speedup  of  a  process  that  can  be  run  in  parallel  on  a  hypercube  is  affected 
by  both  the  computational  load  placed  on  each  processor  and  the  message  traffic 
load  between  processors.  The  purpose  of  this  thesis  is  to  present  a  functional  model 
for  determining  the  impact  of  these  factors  on  speedup.  This  model  allows  for  the 
description  of  the  structure  of  a  parallellized  algorithm  and  prediction  of  speedup 
over  various  sizes  of  the  hypercube.  The  research  hypotheses  are  listed  in  Table  1. 

Scope  and  Limitations 

The  computational  loud  and  the  message  traffic  load  must  be  characterized 
to  determine  the  effects  of  these  two  factors  on  the  speedup  of  a  process  running 
in  parallel.  Since  unbalancing  the  load  over  processors  severely  degrades  speedup 
[M087],  the  total  computational  load  is  assumed  to  be  evenly  distributed  and  can 
run  concurrentlv. 


The  speedup  experienced  by  using  multiple  processors  is  determined  by  relative 
execution  time  on  a  single  processor.  Although  a  particular  process  may  run  faster 
on  another  uniprocessor,  the  comparative  measure  must  be  relative  to  a  processor 
of  the  same  type. 

The  goodness  of  a  particular  parallellized  algorithm  is  not  considered.  The 
model  is  a  means  by  which  the  performance  of  an  algorithm  on  the  hypercube  can 
be  estimated  given  the  computational  and  message  traffic  load  characteristics.  The 
total  execution  time  does  not  account  for  time  needed  to  download  programs  or  data 
to  the  hypercube.  Although  this  time  is  significant,  it  is  more  of  a  function  of  the 
algorithm  and  hypercube  implementation  than  it  is  architectural  performance.  Ni  et 
al.  have  dealt  with  elimination  of  this  bottleneck  during  algorithm  design  [NI87 a] . 

Approach 

The  goal  of  this  thesis  is  to  determine  the  effects  of  the  computational  load, 
the  message  traffic  load  and  their  interaction  on  the  speedup  of  a  process  for  several 
dimensions  of  the  hypercube.  To  moot  this  goal,  an  experiment  is  designed  so  that 
the  variables  can  be  controlled  independently  of  each  other.  Data  is  collected  from 
benchmarking  and  simulation,  and  the  hypotheses  in  Table  1  are  tested  by  statistical 
means.  Finally,  the  results  are  analyzed  and  interpreted.  Figure  2  shows  the  steps 
which  are  summarized  below  and  described  in  detail  in  Chapter  III. 


Step  L:  Identify  the  independent  variables. 


(a)  Three  primary  independent  variables  are  total  computational  workload, 
number  of  processors  and  message  traffic  load.  1  he  total  computational 
workload  is  quantified  as  the  time  for  a  single  processor  to  complete  the 
process.  The  workload  placed  on  a  single  processor  does  not  include  any 
communication  overhead  to  slow  down  total  execution  time.  The  second 
independent  variable,  the  number  of  processors,  is  a  quantity  that  can 
be  controlled  directly.  The  third  variable,  the  message  traffic  load,  is 
quantified  by  the  total  number  of  messages  generated  during  the  process. 

(b)  A  secondary  independent  variable  is  the  processors’  burst  times  between 
transmission  of  messages.  The  burst  times  arc  characterized  as  being  ap¬ 
proximately  the  same  for  all  bursts  or  as  being  completely  random.  Since 
the  computational  workload  is  distributed  evenly  across  the  processors, 
this  variable  must  be  quantified  as  a  binary  variable;  either  the  burst 
times  are  approximately  the  same  or  they  are  are  not. 

Step  2:  Benchmark  the  effects  of  the  primary  independent  variables  on  an  actual 
hypercube  such  as  Intel’s  Personal  Super  Computer  (iPSC). 

(a)  A  matrix  multiplication  algorithm  is  a  good  candidate  to  implement  as 
a  benchmark  program.  Using  this  algorithm,  the  computational  work¬ 
load  can  be  varied  independently  of  t lie  message  traffic  load  by  iterative 
recalculation  of  matrix  elements. 

(b)  The  processors’  computational  times  and  message  generation  and  receipt 
times  are  measured  across  hypercubes  of  dimension  1  through  5  for  several 
different  computational  workloads. 
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Identify  and  quantify 
the  independent  variables 


Benchmark  the  hypercube 


Construct  and  verify 
the  simulation  model 


Validate  the  simulation  model 


Design  the  experiment  and 
exercise  the  simulation  model 


Analyze  and  present 
the  results 


Figure  2.  Research  Approach 


Step  3:  Construct,  and  verify  a  simulation  model  using  discrete  event  simulation.  A 
simulation  of  a  process  running  in  parallel  facilitates  control  of  the  indepen¬ 
dent  variables  and  measurement  of  the  effects  due  to  workload  characteristics 
without  having  to  design  and  implement  actual  workloads  for  the  hvpercube. 
The  model  is  constructed  under  the  conditions  of:  an  evenly  balanced  work¬ 
load,  each  processor  executes  the  same  number  of  bursts,  and  messagjs  are 
generated  between  bursts.  Model  construction  includes  determining  specific 
times  that  are  used  in  the  model: 

(a)  The  iPSC’s  interprocessor  communication  times  are  required  to  model 
message  passing  bet  ween  nodes.  I'o  capture  the  portion  of  communica¬ 
tions  that  is  concurrent  with  processing,  it  is  necessary  to  decompose  the 
message  transmission  time  into  its  components  and  to  estimate  a  time  for 
each  component  . 

(b)  The  amount  of  additional  processing  time  in  the  matrix  multiplication 
program  resulting  from  checking  the  receipt  of  messages  is  required  to 
calibrate  the  simulation  model  to  the  matrix  mulitplication  algorithm. 

Step  4:  Validate  the  simulation  model  by  comparing  the  results  of  the  model  to  the 
results  of  the  benchmark  and  ensuring  through  statistical  means  that  there  is 
no  difference  between  the  two. 

Step  5:  Design  an  experiment  that  exercises  the  simulation  model  in  which  the  total 
workload,  number  of  processors,  message  traffic  load  and  burst  duration  are  all 
varied  independently  of  one  another  to  determine  their  main  and  interactive 
effects  on  the  speedup  of  a  process  run  in  parallel. 

Step  fi:  Analyze  and  present  the  results.  The  relationships  between  the  independent 
variables  are  presented  by  testing  the  second  and  third  research  hypotheses 
listed  in  I  able  1.  Under  the  conditions  of  the  model,  the  relationships  can  be 


used  to  understand  and  predict  relative  speedups  as  a  function  of  the  indepen¬ 
dent  variables  and  draw  conclusions  about  the  nature  of  speedup  phenomena. 

Overview 

Chapter  II  gives  a  summary  of  current  knowledge.  Although  there  is  an  abun¬ 
dance  of  literature  on  multiprocessing.  Chapter  II  includes  literature  that  is  directly 
related  to  this  thesis.  Chapter  111  describes  the  development  of  the  benchmark  and 
the  simulation  model  and  presents  the  experimental  design.  Chapter  IV  discusses 
the  results  of  the  benchmark  and  the  simulation.  Chapter  V  summarizes  the  results 
and  suggests  how  the  results  may  be  used  to  predict  performance  of  algorithms  on 
the  hypercube. 


There  has  been  a  considerable  amount  ot  research  concerning  the  performance 
of  multiprocessors.  Much  of  it  involves  evaluating  the  performance  of  a  particular 
algorithm  on  a  given  architecture.  The  performances  measured  arc  a  function  of  the 
algorithm  design,  the  architecture  and  the  interaction  between  the  two.  That  is, 
how  well  the  algorithm  is  mapped  to  the  architecture.  Evaluating  the  performance 
of  an  architecture  is  difficult  because  it  cannot  be  divorced  from  the  algorithm  used 
to  evaluate  it;  after  all,  the  algorithm  dictates  what  kind  of  workload  will  be  placed 


on  the  hardware. 


The  performance  of  a  parallelized  algorithm  is  sensitive  to  the  coupling  be¬ 
tween  processors.  Loosely-coupled  machines  are  more  efficient  when  processor  inter¬ 
action  is  low,  whereas  a  tightly-coupled  machines  are  more  tolerant  of  interaction 
[HW84],  LcBlanc  conducted  an  experiment  to  find  the  tradeoffs  between  two  con¬ 
figurations  of  the  BBN  Butterfly  Parallel  Processor.  This  machine  can  support  a 
shared-memory  as  well  as  a  message-passing  capability.  LeBlanc’s  case  study  con¬ 
cluded  that  matching  the  application  and  the  computational  model  was  more  im¬ 
portant  than  the  model  itself  [LE8f>]. 


A  constraining  factor  governing  the  processor  interaction  that,  a  loosely-coupled 
system  can  tolerate  and  still  have  reasonable  efficiency  is  the  performance  of  the  in¬ 
terconnection  network.  The  sat  mat  ion  point  for  communicat  ions  is  characteristic  of 
the  interconnection  structure  [ST87].  The  hypercube  is  an  open-ended  architecture 
unlike  shared-memory  and  bus  architectures  which  are  limited  in  their  expansion. 
Also,  since  it  is  a  directly  connected  network,  processes  exhibiting  affinity  can  be 
assigned  to  nearest- neighbor  processors  to  take  advantage  of  communication  locality 
[SE85].  Nicol  and  Willard  took  advantage  of  this  property  in  their  algorithm  and 


found  linear  speedups  since  there'  was  no  contention  for  communication  resources. 
Communication  costs  for  transmitting  da  to  between  partitions  were  independent  of 
total  communications  load  [N’ltsTb],  Another  feature  of  the  hypercube  is  its  adapt¬ 
ability  to  other  communication  topologies  for  different  applications.  Wiley  showed 
that  a  four-dimensional  cube  could  be  used  to  emulate*  a  two-  or  three-dimensional 
mesh,  a  ring  or  a  tree  struct  ure  [WIST]. 

Performance  Models 

A  basic  model  for  total  execution  time  of  processes  on  multiprocessors  or  dis¬ 
tributed  systems  has  been  given  by  both  Indurkhya  (t  al.  and  Stone.  This  model 
assumes  the  processors  are  connected  with  a  bus.  If  a  process  consists  of  M  tasks 
each  requiring  R  time  units  to  complete  and  the  communication  costs  is  C  time 
units,  then  using  a  two-processor  system,  the  total  execution  time,  Te ,  is  given  by 

7;  =  R  max  (M  -  k.k)  +  C(M  -  k)  k  (1) 

where  k  is  the  number  of  tasks  assigned  to  one  of  the  processors.  The  optimal 
assignment  policy  of  tasks  which  minimizes  this  function  is  to  distribute  the  tasks 
evenly  between  the  processors  if  .1//2  <  R/C.  otherwise  all  tasks  are  assigned  to  one 
processor  [IN8Ga]  [ST87], 

This  basic  model  was  extended  to  A  processors,  where  the  total  execution  time 
is  given  by 

C  v 

7;  =  Rnmx(k,)  +  -Y.h(M  ~k.) 

“  i=l 

c  A 

=  /f  max  (/.,)  +  —  M'2  —  ^2  kf  (2) 

1=1 

where  kj  is  the  number  of  tasks  assigned  to  the  /  III  processor.  The  optimal  assign¬ 
ment  policy  is  no  different  from  the  one  given  above  [I.XStja]  [ST87], 


1  I 


Under  the  opt  imal  assignment  policy,  the  cost  of  evenly  distributing  the  tasks 


over  N  processors  is 


and  the  speedup  is  given  by 


RM  CM2  CM2 
X  +  ~  2  X 


HM  +  CM'2  _  CM2 


~  It  ,  M( A'-l) 
r  + - ^ - 

This  means  that  for  small  .V  and  M  and  large  R/C,  the  speedup  depends  more  on 
iV,  but  when  N  gets  large  enough,  the  speedup  is  proportional  to  R/CM  and  does 
not  depend  on  the  number  of  processors  [ST87]. 

An  assumption  made  under  the  previous  models  is  that  every  task  must  com¬ 
municate  with  every  other  remotely  assigned  task.  If  the  assumption  is  changed  so 
that  information  sent  to  a  processor  is  distributed  to  all  its  resident  tasks,  then  a 
linear  communications  cost  model  is  appropriate.  This  means  that  communications 
cost  is  proportional  to  the  number  of  processors  instead  of  the  number  of  tasks.  The 
total  execution  time  for  this  model  is  given  by 

7;  =  R  max  (A-,)  +  C.Y  (5) 

An  even  distribution  of  tasks  produces  the  best  time,  so  the  first  term  would  become 
RM/N.  Execution  time,  T, .  is  minimum  when 


The  effective  parallelism  is  reduced  because  of  communication  costs  [ST87],  Al¬ 
though  communications  aie  not  completely  serial.  Chapter  IV  will  show  that  a  linear 
communications  cost  model  is  also  applicable  to  the  hypercube  when  every  processor 
is  required  to  communicate  once  with  every  other  processor. 


Hypercube  Hardware  Characteristic.' 


Reed  and  Grunwald  benchmarked  the  iPSC's  transmission  time  between  nearest 
neighbor  processors.  Assuming  that  a  message  consists  of  single  packet,  they  modeled 
the  transmission  time  as 

S(i.  =  U  +  I-lc  (7) 

where  ti  is  the  communications  latency,  L  is  the  length  of  the  message  in  bytes  and 
tc  is  the  transmission  time  per  byte,  fusing  a  least  squares  fit  to  their  data,  they 
found  that  /./  =  1.7  milliseconds  and  /,.  =  2.8  microseconds.  Their  evaluation  did 
not  include  intermediate  node  hand-olf  time  for  messages  that  require  multiple  links 
[RE87].  Moore's  thesis  [M087]  and  this  thesis  jointly  duplicated  this  benchmark. 
The  results  are  compared  in  Chapter  111. 

Wiley  claimed  that  a  hypercube  can  support  a  computational  to  communica¬ 
tions  ratio  of  10  to  1.  The  ratio  is  measured  as  the  rate  the  nodes  execute  instructions 
to  the  communications  bandwidth  [WI87],  For  the  iPSC,  the  nodes  execute  1  mil¬ 
lion  of  instructions  per  second  and  the  channel  bandwidth  is  10  megabits  per  second. 
This  implies  that  if  the  10  to  1  ratio  is  not  followed,  communications  channels  will 
become  saturated.  Wiley  further  stated  that  programmers  need  to  keep  this  ratio 
in  mind  when  writing  software  for  the  hypercube  [WI87],  Another  computation 
to  communication  ratio  is  derived  in  Chapter  IV  which  may  be  more  amenable  to 
programmers  for  determining  the  best  sized  hypercube  for  their  application. 

Summary 

The  literature  strongly  suggests  that  the  performance  of  a  multiprocessor  is 
a  result,  in  part,  ol  the  algorithm  chosen  to  run  on  it.  Clearly,  the  architectural 
performance  is  dependent  on  the  workload  placed  on  the  architecture.  Stone’s  [ST87] 
performance  models  use  variables  that  describe  the  workload.  That,  is,  HM  describes 
the  total  computational  load  and  (,W*  —  kf)  describes  the  message  traffic  load. 
1 1  is  models  assume  that  the  process  can  be  partitioned  in  M  tasks  exhibiting  the 
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III.  Research  Method 


The  purpose  of  this  t  liesis  is  to  determine  t he  effects  of  the  computational  load, 
message  traffic  load  and  their  interaction  on  the  speedup  of  a  parallelized  process 
over  five  dimensions  of  the  hypercubo.  In  Chapter  I.  the  independent  variables  that 
describe  the  workload  were'  identified  and  quantified.  To  determine  their  effects, 
an  actual  hypercube  was  benchmarked  using  a  workload  that  could  be  controlled. 
The  benchmark  program  placed  a  homogeneous  workload  on  each  processor.  To 
preclude  the  design  and  implementation  of  various  workloads  for  the  hypercube 
so  that  all  the  independent  variables  could  be  controlled,  a  simulation  model  of 
hypercube  processing  was  constructed,  verified  and  validated  with  the  benchmark. 
To  study  the  effects  of  the  workload  characteristics,  an  experiment  was  designed 
and  the  simulation  model  was  exercised.  Regression  was  then  used  to  determine  the 
functional  relationships  of  the  independent  variables.  These  results  are  presented  in 
Chapter  IV. 

Benchmark 

Intel  s  Personal  Super  Computer  (iPSC)  was  benchmarked  using  a  parallelized 
matrix  multiplication  program.  A  useful  program  was  selected  instead  of  a  kernel 
that  merely  placed  a  workload  on  the  hypereube  so  that  results  of  the  program  could 
be  verified  for  correctness.  A  useful  program  also  gives  an  indication  of  the  overhead 
associated  with  implementation  details,  such  as.  how  a  process  verifies  receipt  of  its 
messages.  Since  that  overhead  is  a  (unction  of  the  particular  implementation,  it  is 
not  studied  in  this  thesis,  but  it  cannot  be  ignored  nonetheless. 


8x8  matrix  where 


Benchmark  Brnt/ram  !  he  benchmark  program  sepia  res  an  8  x  8  matrix  where 
each  node  computes  a  submatrix,  passes  its  results  to  ('very  other  node  and  verifies 


that  is  has  received  the  results  from  every  other  node.  A  copy  of  the  program  is 
downloaded  to  the  hypercube  by  the  host,  which  is  a  lront-end  processor  that  serves 
as  a  link  between  the  hypercube  and  its  users.  Figure  3  is  a  high-level  flowchart 
representation  of  the  nodes'  program  which  is  listed  in  Appendix  A  and  explained 
below. 

1.  Each  node  opens  communications  channels  to  the  host  and  the  cube.  These 
are  logical  communications  channels  and  are  used  in  the  routines  that  handle 
message  transfers.  The  nodes  receive  a  message  from  the  host  telling  them  how 
many  processors  are  active  in  the  cube. 

2.  Each  node  computes  t  he  indices  of  the  first  and  last  elements  of  its  submatrix 
based  on  its  node  identification  number  and  the  number  of  active  nodes  in  the 
cube.  Each  node  will  compute  Gl/.Y  elements  where  A’  is  the  number  of  active 
processors. 

3.  Each  node  marks  its  start  time. 

4.  Each  node  computes  its  submatrix  results  x  number  of  times.  Variations  in  x 
are  what  allow  the  computational  workload  to  be  controlled. 

5.  Each  node  marks  the  time  it  completes  its  computation. 

G.  Each  node  sends  its  results  to  every  other  active  node  in  the  cube.  The  nodes' 
messages  are  uniquely  identified  by  a  different  type.  The  nodes'  identification 
number  is  assigned  by  the  program  as  tin'  message  type. 

7.  At  each  node,  a  flag  for  each  message  type  is  stored  in  a  vector.  The  flag 
indicates  whether  or  not  the  message  has  been  received.  The  type  identifies 
where  the  message  was  originated,  thus  indicating  where  in  the  results  matrix 
the  incoming  data  needs  to  go  and  a  butler  pointer  can  be  set  to  that  loca¬ 
tion.  I  he  vector  id  (lags  is  scanned  and  the  type  of  an  unreceived  message  is 
identified. 


Identify  type 
of  unreceived 
message 


8.  Nodes  check  to  see  if  the  identified  message  type  is  available  for  receipt. 

t).  If  the  message  type  is  available,  the  nodes  receive  the  message  directly  into  the 
appropriate  place  in  the  results  matrix. 

10.  Each  node  uses  the  number  of  messages  it  expects  to  receive  as  a  flag.  If  all 
the  messages  are  not  received,  then  another  message  type  is  identified. 

11.  When  each  node  has  received  all  of  its  messages,  it  marks  the  stop  time  then 
records  its  times  to  the  system  log. 

12.  Each  node  sends  its  results  matrix  to  the  host  then  closes  its  communication 
channels. 

Benchmark  Experiment  The  experiment  was  designed  such  that  the  total  com¬ 
putational  load  is  varied  over  six  dimensions  of  the  hypercube.  The  single  processor 
execution  times  were  recorded  for  determining  the  speedup  performance  of  the  other 
dimensions.  Since  the  total  computational  load  could  not  controlled  directly  in  terms 
of  time,  iterative  recalculation  of  the  mat  rix  was  used  as  the  control  mechanism.  For 
each  size  of  the  hypercube,  the  program  was  executed  5  times  for  varying  computa¬ 
tional  loads.  The  levels  of  the  computational  loads  were  expressed  as  the  number  of 
times  the  elements  were  recalculated.  Table  2  gives  the  levels  of  control  used  in  the 
benchmark.  The  message  traffic  load  was  controlled  with  the  number  of  processors 
and  the  number  of  messages  is  an  aggregate  message  count.  In  all,  270  observations 
of  execution  time  were  recorded. 

Data  Collect  ion  Each  node  recorded  its  computational  and  message  processing 
times  in  the  system  log.  The  computational  time  required  of  a  given  load  for  a  cube 
size  was  taken  to  be  a  mean  of  the  individual  processors'  computational  time.  This  is 
reasonable  since  each  processor  had  tin'  same  load:  the  computational  times  recorded 
by  the  processors  were  all  within  a  milliseconds  of  each  other  and  the  resolution  of 
the  clock  is  also  5  milliseconds.  I  he  message  processing  time,  however,  was  taken 
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Table  2.  Benchmark  bevels  of  Control 


Computational  Load  (.r) 

A  'umber  of  Nodes 

Number  of  Messages 

5 

1 

0 

7 

2 

2 

10 

1 

12 

16 

8 

56 

20 

16 

210 

30 

32 

992 

65 

90 

200 

to  be  the  maximum  time  of  all  the  processors  in  the  cube.  This  is  also  reasonable 
since  the  process  was  not  considered  complete  until  the  last  message  was  received. 
Total  execution  time  is  the  sum  of  the  two  times. 

Simulation  Model 

A  simulation  model  was  constructed  using  SLAM  II  Simulation  Language 
[PR86],  The  model  simulates  a  process  running  in  parallel  on  a  hypercube  of  a 
chosen  size  ranging  from  0  to  5.  The  computational  load  and  message  load  is  bal¬ 
anced  across  all  processors.  Kaeh  node  executes  a  specified  number  of  bursts  and 
each  burst  is  followed  by  a  message  being  sent  to  any  number  of  predetermined  or 
randomly  chosen  receivers. 

Model  Components  To  construct  the  model,  times  for  three  components  had 
to  be  determined.  The  first  component  is  tin*  computational  time  per  node*  which 
was  assumed  to  be  about  1 /A  th  ol  the  total  computational  load.  The  benchmark 


confirmed  that  this  was  a  true  assumption.  The  second  and  third  components  are 
interprocessor  communications  time  and  overhead  due  to  implementation  details. 
Interprocesssor  communications  time  is  addressed  below  however;  the  last  component 
is  addressed  with  model  validation  since  it  is  related  to  validation  and  calibration  of 
the  overall  simulation  model. 

Equation  7  found  in  Chapter  II  estimates  interprocessor  communication  times 
for  nearest  neighbors.  Since  the  simulation  model  simulates  both  nearest-neighbor 
and  non-nearest- neighbor  communications,  the  use  of  Equation  7  in  the  simulator 
was  insufficient.  The  iPSC  uses  a  packet  switched  network  with  predetermined  rout¬ 
ing  so  non-nearest-neighbor  communications  can  be  thought  of  as  a  series  of  nearest- 
neighbor  transmissions.  Along  the  sender /receiver  path,  intermediate  nodes  perform 
the  store-and-forward  function.  Equation  7  can  be  modified  to  model  interprocessor 
communications  over  multiple  links 

T ml  —  ti  +  III  tc  +  Iti  (8) 

where  II  is  the  number  of  hops  or  links  traversed,  I  is  the  number  of  intermediate 
nodes  visited  and  is  the  time  required  for  the  node  to  forward  the  message.  II 
and  /  are  directly  related  since  the  number  of  intermediate  nodes  visited  is  one  less 
than  the  number  of  hops  in  the  sender/ receiver  path.  When  there  is  a  single  link  in 
the  path,  Equation  8  reduces  to  Equation  7. 

To  find  the  actual  message  transmission  times,  a  benchmark  program  was  exe¬ 
cuted  on  the  iPSC.  One  message  was  passed  back  and  forth  between  a  sender/receiver 
pair  100  times  (200  transmissions).  Two  hundred  transmissions  were  chosen  to  over¬ 
come  the  resolution  of  the  clock.  The  transmissions  were  done  in  a  sterile  environ¬ 
ment:  first,  there  were  no  overlapped  transmissions  of  the  message  since  the  both 
nodes  were  required  to  wait  on  the  message  before  it  could  be  returned  to  the  other, 
and  second,  there  were  no  other  communications  in  the  cube  nor  were  there  any 
other  processes  running.  Twenty  messages  of  varying  sizes  (up  to  1024  bytes)  were 
passed  between  31  sender/receiver  pairs,  thus  the  data  set  consisted  of  620  data 


Message  Transmission  Times 
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Figure  4.  Plot  of  Message  Transmission  Times 

points.  Figure  4  shows  the  linear  relationship  between  messages  requiring  the  same 
number  of  hops.  Using  SAS,  Equation  8  was  estimated  from  the  data  producing 

TmL  (ms)  =  1.1232  +  0.0008968L//  +  0.483/  (9) 

The  coefficient  of  determination  for  this  model  is  .9939. 

In  comparison  to  the  results  obtained  by  Reed  and  Grunwald,  the  message 
latency  is  slightly  lower  than  the  1.7  milliseconds  they  reported.  Even  more  surpris¬ 
ing  was  the  difference  in  transmission  time  per  byte;  0.9  microsecond  is  significantly 
faster  than  the  2.8  microseconds  reported.  Later  research  by  Reed  yielded  a  new 
estimation  of  Equation  7  to  be  0.70G  millisecond  for  latency  and  0.519  microsecond 
for  transmission  time  per  byte  [RE88].  He  attributed  the  difference  to  a  revision  in 
the  node  operating  system  which  handles  the  message  processing. 


Model  Construction  The  SLAM  II  source  code  modelling  parallel  processing 
on  a  hypercube  is  listed  in  Appendix  B.  The  basis  for  modelling  communications  in 
the  hypercube  was  Equation  9  since  non-nearest-neighbor  communications  can  be 
thought  of  as  a  series  of  nearest- neighbor  communications  along  the  sender/ receiver 
path.  The  time  spent  at  each  intermediate  node  is  the  coefficient  of  the  third  term  in 
Equation  9,  while  the  transmission  time  per  link  is  the  size  of  the  message  multiplied 
by  the  coefficient  of  the  second  term. 

Each  node  in  the  cube  was  modelled  as  a  unique  RESOURCE.  The  communi¬ 
cations  links  outbound  from  a  node  were  modelled  with  a  single  server  ACTIVITY 
proceeded  by  a  QUEUE.  Usually,  each  outbound  link  would  be  modelled  with  its 
own  server  and  queue,  but  the  iPSC  has  a  peculiar  hardware  characteristic.  If  more 
than  one  physical  channel  transmits  at  a  time,  packets  are  lost;  therefore,  only  one 
channel  is  allowed  to  transmit  at  a  time  [lN86b].  For  packet  transfers  the  iPSC’  uses 
a  predetermined  routing  scheme  based  on  the  logical  exclusive-or  operation  of  the 
nodes1  identification  number.  This  routing  algorithm  was  easily  implemented  with  a 
FORTRAN  function  that  is  visible  to  SLAM  11.  Communications  has  priority  over 
normal  node  processing,  so  packets  arriving  at  a  node  have  the  ability  to  pre-empt 
a  node’s  processing.  The  implementation  of  this  condition  was  trivial  in  SLAM  II 
by  allowing  a  packet  to  PREEMPT  a  task  currently  utilizing  the  node- RESOURCE. 
SLAM  took  care  of  queuing  the  messages  that  arrive  at  the  nodes  for  processing. 

The  concurrent  activities  of  a  parallel  process  on  a  two-  dimensional  hypercube 
are  shown  in  Figure  5.  Activities  that  occur  in  the  unshaded  areas  are  concurrent. 
The  shaded  areas  represent  the  activities  that  required  the  use  of  the  node;  for 
example,  a  node  could  process  messages  or  it  could  process  a  task,  but  it  could  not 
do  both  simultaneously.  Figures  (i  and  7  show  the  (low  of  events  in  the  simulation 
model  which  occur  as  follows: 


F  igure  5.  Concurrent  Activities 
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1.  A  process  enters  the  cube. 


2.  The  process  is  partitioned  into  ,\  tasks;  N  can  be  {1,2,1,8,10,32}. 


3.  Each  task  entity  is  assigned  an  identification  number  which  is  conveniently  the 
identification  number  of  the  node  to  which  it  is  assigned. 


4.  The  start  time  of  the  process  is  recorded. 


5.  Tasks  wait  until  their  node  become  available. 


6.  Tasks  are  processed  for  some  length  of  time.  Burst  lengths  are  independent 


random  variables  from  the  same  distribution. 


7.  Following  the  burst,  a  task  communicates  its  results  to  some  specified  number 
of  other  tasks  by  replicating  itself  into  a  message  entity  for  each  message  it 


sends. 


8.  Once  the  last  message  is  sent,  the  task  releases  the  node. 


9.  If  a  task  is  not  complete,  it  waits  for  node  to  which  it  is  assigned  and  it  is 
processed  again.  When  the  task  is  completed,  the  task  entity  is  terminated. 


Meanwhile: 


10.  A  message  entity  is  assigned  the  next  node  it  must  go  to  on  route  to  its  desti¬ 


nation. 


11.  A  message  entity  enters  the  transmit  queue  at  its  current  node 


12.  A  message  is  transmitted:  from  Equation  9.  the  transmission  time  is  0.9  mi¬ 
croseconds  for  every  byte  of  information  in  the  message. 


13.  At  the  next  node,  a  message  pre-empts  the  node's  task  processing.  If  this 


node  is  the  destination,  the  node  receives  the  message.  If  this  node  is  not 


the  destination,  intermediate-node  processing  occurs  and  the  next  node  in  the 
path  is  assigned  and  the  message  continues  to  traverse  the  network. 


14.  The  node  is  released  to  return  to  task  processing. 
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Figure  G.  Simulation  Flow  of  Events  (Part  A) 


15.  The  message  is  counted  and  terminated  if  it  is  not  the  last  one. 

16.  When  the  last  message  is  received,  the  completion  time  is  recorded  and  the 

process  is  terminated. 

Mode 1  Validation  Before  the  model  could  be  validated  with  the  benchmark,  it 
had  to  be  calibrated  to  the  exact  implementation  of  the  benchmark  program.  There 
was  overhead  introduced  into  tin'  benchmark  program  from  measuring  the  clock 
and  checking  the  receipt  of  messages.  Message  checking  ensured  that  all  messages 
were  received  and  the  data  was  stored  in  the  appropriate  place  in  the  matrix.  This 
additional  workload  required  computation  time  that,  must  be  accounted  for  in  the 
simulation  model. 

To  find  the  calibration  time,  the  actual  time  required  to  execute  message  check¬ 
ing  was  measured  in  the  benchmark  program  for  all  dimensions  of  the  hypercubc. 
The  time  required  for  the  process  to  actually  receive  the  message  was  subtracted  out, 
thus  leaving  the  overhead  associated  wit  h  checking  Hags.  I’sing  linear  regression,  the 
calibration  function  for  N  >  2  was  estimated  to  be 

C  =  5  +  7.5.Y  (10) 

where  C  is  the  amount  of  time,  in  milliseconds,  required  to  perform  overhead  and 
N  is  the  number  of  nodes  in  the  cube.  For  t Ik*  two-node  case,  the  overhead  was 
negligible. 

The  calibration  function  was  incorporated  into  the  model  for  validation  pur¬ 
poses.  I  he  simulation  was  exercised  with  four  uniprocessor  loads  which  were  taken 
from  actual  uniprocessor  times  observed  in  the  benchmark  experiment.  The  individ¬ 
ual  node  processing  times  were  taken  from  a  normal  distribution  with  //  computed 
as  the  uniprocessor  load  divided  by  I  he  number  of  nodes  and  a  equal  to  2.5  millisec¬ 
onds  since  the  benchmark  showed  that  the  computational  times  of  the  individual 
nodes  were  all  within  5  milli-econds  <>f  each  other.  The  total  processing  time  for 
each  node  was  completed  in  one  burnt  which  simulates  the  benchmark  program.  For 


Table  3.  Uniprocessor  Loads  Used  for  Validation 


Uniproce  ssor  Loud 

Raw  Time  R<  ciilnilalions 

A  it  m be  r  of  Nodes 

200 

0 

2 

1205 

30 

4 

3015 

05 

8 

8010 

200 

10 

32 

each  uniprocessor  load,  five  runs  were  made  for  cube  sizes  of  1  through  5.  for  a  total 
of  100  simulation  times.  Table  3  shows  how  the  100  simulation  times  were  collected. 
The  uniprocessor  loads  are  also  expressed  in  terms  of  the  number  of  recalculations 
in  the  benchmark  program. 

The  five  runs  for  a  given  uniprocessor  load  and  cube  size  were  averaged  as  well 
as  the  five  runs  of  the  benchmark  program  corresponding  to  the  uniprocessor  load 
and  cube  size.  The  first  research  hypot  hesis  listed  in  Table  1  was  tested  with  a  paired 
difference  t  test  at  a  confidence  level  of  0.1.  The  raw  times  were  used  to  perform  the 
test  since  the  speedup  within  each  pair  is  relative  to  the  same  the  uniprocessor  time. 

Experiment  Design 

To  test  the  second  and  third  research  hypotheses  listed  in  Table  1.  an  exper¬ 
iment  was  designed  which  exercised  (lie  simulation  model  without  the  calibration 
function.  The  uniprocessor  load  was  varied  across  two  levels.  The  message  traf¬ 
fic  load  was  varied  by  introducing  two  more  variable’s  that  quantify  the  amount  of 
message  tralfic.  These  two  variables  were  the  number  of  bursts  and  the  number  of 


Table'  I.  ( 'out  rol  Variable's 


(  niproccssor  Tinn 

AWf  s 

Bursts  Receivers 

1000 

2 

1  N  /  2 

20000 

1 

5  N  -  1 

8 

16 

:12 

messages  sent  per  burst  which  was  expressed  as  a  function  of  the  number  of  nodes. 
Two  levels  of  each  were  selected. 

The  general  linear  model  for  total  execution  time  is 

lei.  =  //  +  —  +  A  BR  +  error  (11) 

where  //  is  the  experimental  average.  I  is  tin  uni  pi  lessor  fiio^.  A’  Is  the  number 
of  nodes.  B  is  the  number  of  bursts  and  /?  is  the  number  of  receivers.  The  general 
linear  model  for  speedup  is 

log  S(  i  =  log  l '  +  log  I'd  +  ('nor  (1'2) 

I  a  vc  Is  <>J  Ih (  Control  Yu  riabh  .s  The  control  variables  were  sot  to  levels  shown 
in  'I able  1.  Kacli  experimental  unit  consisted  of  a  uniprocessor  time,  a  cube  size,  a 
level  ol  bursts  and  a  level  ol  receivers.  In  the  case  where  A  =  2,  both  levels  of  /? 
were  the  same,  hxeluding  the  duplicated  units,  there  was  a  total  of  .56  experimental 
units.  I  he  computational  time  lot  each  node  was  randomly  selected  from  a  normal 
distribution.  In  the  case  ol  lx  =  A  —  1.  every  node  sent  to  every  other  node,  but 
when  R  =  A/2,  the  receiving  nodes  were  randomly  selected.  W  hen  the  number  of 
bursts  was  1.  the  burst  length  was  the  nodes  computational  time,  but  when  the 


number  of  bursts  was  5.  the  burst  duration  was  randomly  selected  from  a  normal 
distribution. 

Duta  Collection  For  each  experimental  unit.  10  runs  were  made,  'l'o  test 
the  third  research  hypothesis,  ten  additional  runs  were  marie  for  the  experimental 
units  that  include  5  bursts,  but  tin'  burst  lengths  were  randomly  chosen  from  an 
exponential  distribution.  Fable  •'>  summarizes  the  entire  experiment  design. 


Fable  o.  Kxperimeut  Design 


Burst 

(  it  i /trove  ssor 

Numbe  r  of 

Number  of 

R  it  ns 

Lcnejth 

Time 

Nodes 

B  u  rsts 

Receivers 

10 

Normally 

1000 

■> 

1 

N  -  1 

Distributed 

20000 

5 

10 

Normally 

1000 

i 

1 

A’  /  2 

Distributed 

20000 

8 

10 

32 

5 

:V  -  1 

10 

Exponent  ially 

100(1 

2 

5 

N  -  1 

Dist  ribiited 

20000 

10 

Exponent  iady 

1000 

i 

5 

•V  /  2 

Dist  ribiited 

20000 

8 

Hi 

.V  -  1 

32 

Su  m  inary 

This  chapter  explained  the  procedure  followed  in  this  research.  A  hypercube 
was  benchmarked  using  a  matrix  multiplication  algorithm.  A  discrete  event  simula¬ 
tion  of  a  parallel  process  running  on  a  hypercube  was  then  constructed  and  validated 
with  the  benchmark  data,  finally,  an  experiment  was  designed  that  exercised  the 
simulation  model  so  that  the  functional  relationship  between  workload  characteris¬ 
tics  and  speedup  characteristics  could  lie  determined. 


This  chapter  presents  the  results  of  the  benchmark  and  the  discrete  event 
simulation  experiment  and  evaluates  the  research  hypotheses  listed  in  Table  1. 

Evaluation  of  tht  First  lie s( arch  Hypothesis 

Results  of  the  benchmark  experiment  are  given  in  Appendix  C.  Table  6  shows 
the  average  times  for  selected  computational  loads  which  are  expressed  in  terms  of 
the  number  of  recalculations.  The  results  of  the  simulation  model  with  the  calibra¬ 
tion  function  incorporated  are  listed  in  Appendix  D.  Table  6  also  shows  the  average 
times  obtained  from  the  simulation.  Figure  8  shows  the  standardized  deviations 
of  the  simulation  times  f  ora  the  benchmark  times  for  the  uniprocessor  loads  listed 
in  Table  G.  The  test  statistic  tor  the  paired  difference  t  test  was  0.22,  therefore, 
the  first  research  hypothesis  was  not  rejected  and  it  was  concluded  that  the  simula¬ 
tion  model  was  an  accurate  representation  of  hypcrcubc  performance  for  the  matrix 
multiplication  workload. 


Table  6.  Benchmark  Versus  Simulation  Times 


Total  Time  (ms) 

Recalculations 

U n iprocesso r  Time 

Nodes 

Benchmark 

Simulation 

5 

200 

2 

104 

104 

4 

104 

95 

8 

130 

112 

16 

169 

184 

32 

348 

351 

30 

120 

2 

606 

607 

4 

349 

346 

8 

255 

238 

16 

224 

247 

32 

398 

382 

65 

3615 

2 

1812 

1814 

4 

944 

950 

8 

533 

540 

16 

391 

398 

32 

445 

458 

200 

8010 

2 

4009 

4009 

4 

2056 

2048 

8 

1095 

1089 

16 

665 

672 

32 

591 

595 

Benchmark  Versus  Simulation 

Standardized  Deviation 


200  Recalculations 


65  Recalculations 


30  Recalculations 


|  ^  1  5  Recalculations 


Number  of  Nodes 


Figure  8.  Comparison  of  Benchmark  and  Simulation  Times 


Benchmark  Results  From  the  benchmark  data  contained  in  Appendix  C,  two 
equations  were  estimated.  Figure  9  shows  the  mean  computational  time,  communi¬ 
cations  time  and  total  execution  time  for  10  recalculations  of  the  matrix.  From  2 
to  32  nodes,  the  message  processing  time  increased  almost  linearly  with  the  number 
of  nodes  while  the  computation  time  decreased  inversely  with  the  number  of  nodes. 
This  graph  suggests  a  linear  communication  cost  model  similar  to  Stone’s  [ST87], 
The  equation  for  the  total  execution  time  was  estimated  to  be 


Tb  =  -4.7+  1  +  10.G1.V 


The  coefficient  of  determination  for  this  model  is  .9998.  The  graph  also  shows  that 
the  message  processing  time  for  a  single  node  is  zero  and  negligible  for  two  nodes. 


This  accounts  for  the  negative  constant  in  the  model  which  is  required  to  fit  the  rest 
of  the  data. 

This  model  is  similar  to  Stone's  since  the  uniprocessor  time  is,  in  terms  of  his 
variables,  RM .  Equation  13  can  also  be  minimized  with  respect  to  N  by  setting  the 
first  derivative  equal  to  zero 

=  ^  +  i0-64  =  0  (14) 

and  solving  for  N 

A’  = 

Although  derived  in  a  different  manner,  the  relationship  between  the  optimal  level 
of  N  and  both  the  uniprocessor  time  and  the  communications  time  is  the  same  rela- 
ionship  that  was  given  by  Stone  in  Equation  6.  The  communications  cost  of  10.64 
milliseconds  per  node  included  the  actual  communication  times  and  the  overhead 
associated  with  checking  the  receipt  of  messages  as  discussed  earlier. 

Although  Ecpiation  13  accurately  models  the  total  execution  time  in  terms  of 
the  uniprocessor  load  and  the  number  of  nodes,  information  about  the  message  traffic 
load  is  embedded  in  the  number  of  nodes  and  the  assumption  that  every  node  sends 
once  to  every  other  node.  Equation  11  allows  the  assumption  about  the  message 
traffic  load  to  be  relaxed  since  nodes  can  send  one  or  more  messages  to  a  subset  of 
the  cube.  Equation  11  was  estimated  to  be 

Tcl  =  16.7  +  1  +  .03  X  BR  (16) 

For  the  benchmark  program,  B  =  1  and  R  =  N  —  1.  In  this  model,  the  three  pri¬ 
mary  independent  variables  which  describe  the  workload  were  used.  The  coefficient 
of  determination  for  this  model  is  .99!)2.  Figure  JO  compares  the  predictive  power 
of  the  two  models.  Both  models  indicate  that  the  computational  time  decreases 
with  the  number  of  nodes,  but  communication  time  increases  as  more  nodes  are 
added.  Figure  1  1  shows  the  speedup  curves  for  four  different  uniprocessor  loads. 
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Hypercube  Benchmark 


Matrix  Recalculated  16  Times 


0  4  8  12  16  20  24  28  32 

Number  of  Nodes 

■  Actual  +  Model  1  o  Model  2 


Figure  10.  Comparison  of  Models 

The  point  where  each  curve  turns  down  indicates  which  level  of  N  minimizes  Equa¬ 
tions  13  and  16  relative  to  the  uniprocessor  load. 

Computation  to  Communications  Ratio  The  benchmark  data  can  be  used  to 
derive  another  computational  to  communications  ratio.  The  speedup  achieved  in  the 
benchmark  in  terms  of  N  was 

U 

Sb  —  ~rj  (If) 

iy.  +  10.GTV  -  4.7 

If  Sb  is  set  equal  to  80%  of  the  ideal  speedup,  then  solving  for  U  yields 

U  =  42.4  X2  -  18.8  (18) 
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Therefore,  1o  achieve  80%  of  ideal  speedup.  the  ratio  of  computa* 'mia!  time  per  node 
to  total  communications  time  is  approximately  4  to  i  since 


42.4  A’2  -  18.8 


10.64  A'2 


To  reach  80%  of  the  ideal  speedup  in  the  case  of  a  balanced  load  over  4  or 
more  processors,  the  computational  time  per  node  must  be  4  times  as  great  as  the 
total  communications  time.  Said  differently,  the  uniprocessor  time  to  total  commu¬ 
nications  time  must  be  4 A’  to  1.  For  example,  if  each  node  sends  one  message  to 
every  other  node  after  completing  its  computation  and  4  nodes  were  used,  then  the 
uniprocessor  time  must  be  about  660  milliseconds  to  obtain  a  speedup  of  3.2;  how¬ 
ever,  for  8  nodes,  the  uniprocessor  time  must  be  about  2700  milliseconds  to  achieve 
a  speedup  of  6.4.  As  more  communication  time  is  required,  either  by  nodes  sending 
more  messages  or  more  nodes  being  added  to  the  system,  the  uniprocessor  time  must 
also  increase  to  maintain  the  same  speedup.  When  Ar  —  2,  the  communications  time 
is  not  significant;  so  near  ideal  speedup  will  be  achieved. 


Evaluation  of  the  Second  Research  Hypothesis 


Since  the  first  hypothesis  was  not  rejected,  the  data  obtained  from  the  simula¬ 
tion  model  was  analyzed  with  confidence  that  it  was  representative  of  the  hypercube 
architecture.  The  actual  times  obtained  from  the  simulation  are  optimistic  since  the 
model  does  not  account  for  processing  time  required  for  a  task  to  verify  receipt  of 
messages.  The  simulation  can.  however,  show  relative  speedup  for  varying  workloads. 


T  he  average  execution  times  obtained  from  the  simulation  when  the  burst 
lengths  are  normally  distributed  are  given  in  Table  7.  Appendix  E  contains  the 
complete  data  set.  Figures  12  and  13  compare  the  speedup  achieved  for  each  level 
of  B  and  R  for  the  lOOO-millisoeond  and  2l).0t)<)-miHisecond  loads  respectively. 


ws 


Table  7.  Simulation  Results 
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<1 

2 

273 

5325 

8 

1 

151 

2757 

1G 

8 

95 

14  M 

32 

1G 

91 

7G0 

2 

1 

517 

10270 

4 

3 

275 

5327 

8 

7 

157 

2735 

1G 

15 

113 

1434 

32 

31 

135 

805 

1 

2 

291 

5313 

8 

1 

192 

2797 

1G 

8 

188 

1506 

32 
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■  B=l,  F  =  N  -  1  +  B=  1 ,  R  =  N / 2  oB=5,  R  =  N-1  c.  B  =  5,  R  =  N/2 

Figure  13.  Plot  of  Speedup  for  Total  Computational  Load  of  20,000  ms 

and  regression  analysis  revealed  a  functional  relationship  between  the  independent 
variables,  the  second  research  hypothesis  was  rejected.  Equation  19  indicates  that 
the  total  computational  load,  number  of  nodes  and  message  traffic  load  all  affect  the 
speedup  of  a  process  run  in  parallel  on  the  hypercubo. 


Evaluation  of  the  Third  Rest  arch  II ijpotlit sis 

The  total  execution  times  collected  from  the  simulation  when  the  burst  lengths 
were  exponentially  distributed  arc  given  in  Appendix  E.  For  every  experimental  unit 
at  R  =  A’ —  1 ,  the  results  were  identical  to  the  normally  distributed  burst  lengths. 
When  R  =  A/2,  there  were  very  slight  diflerenees  in  the  total  execution  time.  The 
raw  data  indicated  that  there  was  no  difference  in  the  total  execution  time  of  a 
process  explained  by  the  burst  times,  so  the  third  hypothesis  was  not  rejected. 


V.  Conclusions 


Cent  rul 

The  communication  costs  incurred  by  parallollization  reduces  the  speedup  of 
a  process  run  on  a  hvpercube.  Although  this  is  not  a  novel  idea,  this  thesis  pre¬ 
sented  the  functional  relationship  between  the  workload  characteristics  that  allect 
the  speedup  of  a  process  on  a  hypereube  architecture,  The  methodology  presented 
here  is  directly  applicable  to  any  architecture. 

Research.  Results 

The  speedups  achieved  in  both  the  benchmark  and  the  calibrated  simulation 
were  not  significantly  different.  The  data  listed  in  Table  6  supported  the  first  research 
hypothesis.  Since  the  simulation  model  could  be  validated  with  the  benchmark 
program,  it  was  assumed  to  bean  accurate  representation  of  hypercube  performance. 

The  speedups  obtained  from  the  simulation  changed  with  the  workloads  placed 
on  the  cube.  The  functional  relationship  between  the  workload  characteristics  and 
the  total  execution  time  was  given  in  b.quation  19;  therefore,  the  second  hypothesis 
was  rejected.  When  the  experiment  was  repeated  with  the  exponentially  distributed 
burst  lengths,  the  data  was  not  significantly  different,  so  the  third  hypothesis  was 
not  rejected. 

Benchmarking  l  erstts  Simulation 

The  benchmark  of  an  actual  hypereube  provided  some  insight  into  the  speedup 
phenomenon.  Croat  ing  eont  reliable  workloads  is  cumbersome,  so  a  simulation  model 
of  hypereube  processing  was  constructed,  verified  and  validated.  Unlike  benchmark¬ 
ing,  the  simulation  allowed  more  flexibility  and  control  of  the  workload  charac¬ 
teristics.  The  model  did  not  account  for  overhead  processing  innate  to  an  actual 


implementation  of  a  parallelized  algorithm.  Although  benchmarking  captures  the 
overhead  processing,  it  will  change  from  algorithm  to  algorithm  and  with  the  chosen 
implementation.  As  shown  by  the  benchmark,  the  overhead  processing  cannot  be  ig¬ 
nored  since  it  also  affects  speedup.  Simulation,  on  the  other  hand,  provided  a  means 
for  studying  the  performance  of  the  architecture  in  terms  of  workload  characteristics 
by  removing  the  implementation  and  algorithm  variable.  Simulation  is  useful  for 
comparing  potential  algorithms  implemented  on  the  hypercube.  This  requires  the 
algorithm  to  be  described  only  in  terms  of  the  workload  it  places  on  the  nodes  and 
the  communications  network. 

Impact  of  Workload  on  Speedup 

The  results  obtained  from  the  simulation  were  not  surprising.  Clearly,  if  the 
computational  workload  per  node  is  large  compared  to  the  communication  time 
required,  then  speedups  will  be  near  ideal  as  shown  in  Figure  13.  At  32  nodes,  the 
speedup  was  about  28  when  the  message  load  was  at  the  lowest  level,  but  the  speedup 
dropped  to  approximately  1G  when  the  message  load  was  at  the  highest  level.  At  16 
nodes,  there  was  little  difference  in  speedup  since  the  total  message  traffic  load  at 
all  four  levels  was  small  compared  to  the  computational  load  on  each  node.  When 
the  total  computational  load  was  1000  milliseconds,  the  impact  of  message  traffic 
load  was  felt  at  8  nodes,  as  shown  in  Figure  12.  All  the  data  collected,  either  from 
the  simulation  or  from  the  benchmark,  supported  a  computational  time  per  node  to 
total  communications  time  ratio  of  1  to  1. 

The  results  obtained  when  the  burst  lengths  were  exponentially  distributed 
suggest  that  as  long  as  the  total  computational  workload  and  the  message  traffic 
load  are  balanced,  the  points  in  time  when  computation  and  message  processing 
occur  have  no  affect  on  the  speedup.  This  means  that  the  amount  of  workload 
placed  on  each  processor  is  the  same,  but  the  behavior  of  each  workload  can  be  very 
different.  Assuming  that  process  synchronization  is  not  an  issue,  this  finding  implies 

it; 


that  the  goal  of  a  decomposition  strategy  should  be  to  balance  the  load  without 
regard  to  homogeneous  behavior.  Process  synchronization  was  not  modelled,  so 
balancing  some  workloads  may  impose  additional  time  for  synchronization. 

Suggestions  for  Further  Research 

This  thesis  has  shown  the  effect  of  the  message  traffic  load  on  speedup.  Previ¬ 
ous  research  has  shown  that  soecdup  is  degraded  by  an  unbalanced  computational 
workload.  Another  issue  to  study  would  be  the  effects  of  an  unbalanced  message 
traffic  load  and  a  balanced  computational  load  on  the  speedup  of  a  process  run 
in  parallel.  Perhaps  slight  imbalances  in  computational  loads  could  be  offset  with 
complementary  imbalances  of  message  loads.  The  message  traffic  generated  when 
the  burst  lengths  were  normally  distributed  was  not  at  a  high  enough  intensity  to 
saturate  the  network.  How  would  speedup  be  affected  if  the  assumption  about  sin¬ 
gle  packet  messages  were  loosened  to  allow  multiple  packets  so  that  the  message 
traffic  load  could  saturate  the  network;  and,  could  network  saturation  be  overcome 
by  exponentially  distributed  burst  lengths?  The  use  of  simulation  opens  a  door  for 
studying  the  impact  various  process  behaviors  have  on  speedup  for  any  architecture. 

Summary 

This  thesis  has  successfully  demonstrated  the  use  of  benchmarking  and  simu¬ 
lation  to  determine  the  effects  of  the  workload  on  the  speedup  of  a  process  run  in 
parallel  on  the  hypercube  architecture.  Additionally,  simulation  was  shown  to  be 
a  viable  tool  for  investigating  the  speedup  phenomenon  by  providing  flexibility  and 
easy  control  of  the  independent  variables. 


Appendix  A.  Benchmark  Program 


This  appendix  contains  the  listing  of  the  node  program  used  to  benchmark 
the  iPSC.  The  second  file  included,  “declare.h”  contains  declarations  used  by  both 
the  nodes  and  the  host,  and  is  listed  after  the  program.  A  copy  of  this  program  is 
downloaded  by  the  host  to  each  node  in  the  cube. 

#include  "/usr/ipsc/lib/cnode .def " 

•include  "declare.h" 


/*  global  variables  */ 

int  my_pid,my_node;  /*  local  process  and  node  number  */ 
int  nprocs;  /*  number  of  processors  in  cube  */ 
int  host.chan  /*  channel  for  host-node  communication  */ 
int  node_chan  /*  channel  for  node-node  communication  */ 
int  cnt;  /*  incoming  message  size  */ 
int  fr_node;  /*  node  message  came  from  */ 
int  fr.pid;  /*  process  ID  that  message  came  from  */ 
int  msg.length;  /*  outgoing  message  size  */ 
long  clockO;  /*  clock  reading  */ 
long  start_time;  /*  starting  time  */ 
long  stop_time;  /*  time  process  finishes  completely  */ 


long  end.mult;  /*  time  process  finishes  computing  matrix  */ 
char  msgbuf[80];  /*  buffer  for  messages  to  system  log  file  */ 
float  result [SIZE] [SIZE] ;  /+  resulting  matrix  */ 


main  () 

{ 

setup  ();  /*  open  communication  channels  */ 

multiplyO;  /*  compute  matrix  and  pass  results  */ 

sendw  (host_chan,  UPLOAD,  result, 

SIZE  *  SIZE  *  sizeof  (float),  HOST, 

HOST_PID) ;  /*  send  results  to  host  */ 

cclose  (host_chan) ; 

sprintf  (msgbuf,  "  */,ld  %ld",  stop_time  -  start_time, 

end_mult  -  start _time) ; 

syslog  (my_pid,  msgbuf);  /*  total  time,  cpu  time  */ 

> 


/*  Open  communication  channels  and  receive  size  of  cube  */ 
/*  from  the  host.  */ 

setupO 

{ 

my_pid  =  mypid  () ; 
my_node  =  mynode  () ; 
host_chan  =  copen  (my_pid); 
node_chan  =  copen  (my_pid) ; 

recvw  (host_chan,  PARAM,  &nprocs,  PARAM_MSG_SIZE, 

&cnt,  &fr_node,  &f r_pid) ; 

msg_length  =  sizeof (float)  *  (SIZE*SIZE)  /  nprocs; 


ID 


■AA-AV 


o> 


k.t'i 


/*  Compute  the  matrix. 


multiply () 

{ 


int 

msg_rec[32] ; 

/*  flag  vector  for  message  status 

*/ 

int 

msg-‘type; 

/*  next  two  variables  are  used  to 

*/ 

int 

ans ; 

/*  check  the  status  of  messages 

*/ 

int 

count ; 

/*  number  of  messages  expected 

*/ 

int 

node; 

/*  destination  of  a  message 

*/ 

int 

i.  j ,  k; 

/*  loop  counters  and  matrix  indices 

*/ 

int 

recomp ; 

/*  loop  counter  for  recalucaltions 

*/ 

int 

f _row; 

/*  index  of  first  row  of  submatrix 

*/ 

int 

l_row; 

/*  index  of  last  row  of  submatrix 

*/ 

int 

cols ; 

/*  number  of  columns  in  submatrix 

*/ 

int 

f _col ; 

/*  index  of  first  column 

*/ 

int 

l_col ; 

/*  index  of  last  column 

*/ 

float 

*buf ptr ; 

/*  place  to  put  incoming  results 

*/ 

float 

temp ; 

/*  temporary  storage  used  for 

*/ 

/*  recalculating  matrix 

*/ 

/*  Compute  the  indices  of  the  the  submatrix  based  on  the  */ 
/*  number  of  processors  and  nodes'  identification  number.  */ 

cols  =  SIZE  *  SIZE  /  nprocs; 

f_row  =  my_node  *  SIZE  /  nprocs; 

l_row  =  ((my.node  +  1)  *  SIZE  -  1)  /  nprocs; 
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f_col  =  (my_node  *  cols)  '/,  SIZE; 

l.col  =  ((my_node  +  1)  *  cols  -  1)  ’/,  SIZE; 

bufptr  =  ^result [f _row] [f _col] ; 

temp  =  0.0; 

start_time  =  clockO; 

for  (recomp  =  0;  recomp  <  CPU_L0AD;  recomp  ++) 
for  (i  =  f_row;  i  <=  l_row;  i  ++) 
for  (j  =  f _col ;  j  <=  l_col;  j  ++) 

{for  (k  =  0;  k  <  SIZE;  k  ++) 

temp  =  temp  +  MATRIX  [k]  [j]  *  MATRIX  [i] [k] ; 
result  [i]  [j]  =  temp; 
temp  =  0.0; 

} 

end_mult  =  clockO  ; 

for  (node  =  0;  node  <  nprocs;  node  ++) 
if  (node  !=  my_node) 

send  (node_chan,  my_node,  bufptr,  msg_length, 
node,  NODE.PID); 

for  (i=0;  i  <  nprocs;  i++) 
msg_rec[i]  =  1; 

count  =  nprocs  -  1; 
rosg.type  =  0; 
do 

{if  (msg_rec[msg_type]  &&  (msg_type  !=  my_node)) 


■Cans  =  probe  (node_chan,  msg_type)  ; 


if  (ans  >=  0) 

{bufptr  =  &result [msg_type*SIZE/nprocs] 
[(msg.type  *  cols)  */,  SIZE]; 
recv  (node_chan,  msg_type,  bufptr, 

msg_length,  &cnt,  &fr_node,  &f r_pid) ; 


count  — ; 

msg_rec[msg_type]  =  0; 

> 

> 

msg_type  =  (msg.type  +  1)  */,  nprocs; 

} 

while  (count) ; 


} 


w 


« *  , 


/*  message  types  */ 


#def ine 

PARAM 

40 

/* 

host-to~node : 

cube  dimension 

*/ 

#def ine 

UPLOAD 

70 

/* 

node-to-host : 

upload  results 

*/ 

#def ine 

PARAM, 

MSG.SIZE 

sizeof  (int) 

/*  cube  definitions  */ 


#def ine  H0ST_PID  1 

#define  N0DE_PID  1 

#def ine  ALL_N0DES  -1 

#def ine  HOST  0x8000 


/*  parameter  definitions 

*/ 

#def ine  SIZE 

8 

/* 

size 

of 

square  matrix 

*/ 

#def ine  CPUJL0AD 

5 

/* 

number 

of  recalculations 

*/ 

#define  nextarg  { 

argc— ; 

;  argv++; 

> 

/*  used  in  host  program 

*/ 

float  MATRIX  [SIZE] [SIZE]  = 


{0, 

•  2, 

0. 

.3, 

0 

.1, 

0, 

•  4, 

0, 

.5, 

0, 

.7, 

0, 

.9, 

1. 

.0}, 

{0, 

•  1, 

0 

•  2, 

0 

•  6, 

0 

■  1, 

0, 

.2, 

0 

•  1, 

1 

.0, 

0. 

■  5} , 

{0. 

5, 

0, 

2, 

0. 

•  1. 

0, 

•  2, 

0. 

5, 

0, 

■  7, 

0. 

.1, 

0. 

9>, 

{0, 

■  2, 

0 

•  3, 

0 

•  2, 

0 

•  3, 

0, 

■  8, 

0 

•  4, 

1, 

.0, 

0, 

rS 

CN 

{0. 

■  8, 

0, 

•  6, 

0 

•  4, 

0 

■  2, 

0. 

•  5  , 

0, 

■  7, 

0, 

.2, 

0, 

.1}, 

{0. 

■  5 , 

0. 

.2, 

0 

•  6, 

0. 

.5, 

0. 

9, 

0  . 

6, 

0. 

,4, 

0. 

•  6} , 

{1. 

■0, 

0 

■  1, 

0 

•  1, 

0 

■  7, 

0, 

•  8 , 

0, 

.5, 

0, 

.8, 

0, 

•  9>, 

{1. 

.0, 

0, 

■  3, 

0 

.6, 

0, 

•  9, 

1 . 

2 , 

0, 

■3, 

0, 

.2, 

0. 

,1> 

}; 


Appendix  B.  SLAM  II  Source  Code 

This  appendix  contains  the  SLAM  II  source  code  and  the  FORTRAN  function 
used  for  message  routing.  This  source  list  ing  is  set  up  to  simulate  32  nodes  processing 
5  burst.  The  burst  lengths  are  normally  distributed  and  the  number  of  receivers  is 
31  ( N  —  1).  The  uniprocessor  load  is  1 000  milliseconds. 

GEN, CATHY, HYPERCUBE  SIMULATION , 3/20/88 , 10 ,N ,N , , ,Y/1,72; 

LIMITS, 128, 20, 5000; 

EQUILVALENCE/ATRIB(2) .SOURCE/ 

ATRIB(3) .DESTINATION/ 

A'lnlB(  l)  ,NEXT_N0DE/ 

ATRIB(5) , CURRENT_N0DE/ 

ATRIB(6) , AWA_FILE/ 

ATRIB(7) , PRE_FILE/ 

ATRIB(8) .RECEIVER/ 

ATRIB(9) .CHANNEL/ 

ATRIB(IO) .COUNT; 

EQUILVALENCE/ATRIB ( 11) ,XMT_QUE/ 

ATRIB( 12) ,XMT_ACT/ 

ATRIB( 13) , BURST_TIME/ 

ATRIB(14) , BURSTS_RMNG/ 

ATRIB( 15) ,TIME_RMNG/ 

ATRIB( 16) ,MEAN_BRST_LEN/ 

ATRIB(17) ,STD_DEV/ 

ATRIB( 18) ,LAST_MSG ; 

EQUILVALENCE/XX(41) .NBURSTS/ 

71 


XX(42) .NPROCS/ 

XX(43) ,XMT_TIME/ 

XX (44) .OVERHEAD/ 

XX(45) ,MSG_SIZE/ 

XX(46) , INT_TIME/ 

XX(47) .MSGS/ 

XX(48) ,CMP_TIME/ 

XX(49) ,MSG_INIT/ 

XX(50) .NFINISHED/ 

XX (51) ,MSG_COUNT ; 

PRIORITY/1, LVF(9)/2.LVF(9)/3,LVF(9)/4,LVF(9)/ 

5 , LVF (9) /6 , LVF (9) /7 , LVF (9) /S ,LVF (9) ; 

PRIORITY/9, LVF(9)/10,LVF(9) /II ,LVF(9)/12,LVF(9)/ 
13 , LVF (9) / 14, LVF (9) / 15 , LVF(9)/ 16 , LVF (9) ; 

PRIORITY/17, LVF(9) /18, LVF (9) /19, LVF (9) /20,LVF(9)/ 
21 , LVF (9) /22 ,LVF(9) /23 ,LVF(9)/24,LVF (9) ; 

PRI ORITY/ 25 ,  LVF  ( 9 )  / 2 6 ,  LVF  ( 9 )  / 27  ,  L VF  ( 9 )  /  28 ,  LVF  ( 9  )  / 
29 , LVF (9 ) /30 , LVF ( 9 ) /3 1 , LVF (9)/32,LVF(9) ; 

ARRAY( 1, 32) /O, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0,0, 

0,0, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0; 

> 

NETWORK ; 

RESOURCE/ 1,N0DE_0(1) ,33,65; 

RESOURCE/2 ,N0DE_1 (1) ,34,66; 

RESOURCE/3 ,N0DE_2 (1) ,35,67; 

RES0URCE/4,N0DE_3 (1) ,36,68; 

RESOURCE/ 5 , N0DE_4 ( 1 ) ,37,69; 

RESOURCE/6 , N0DE_5 ( 1 ) ,38,70; 


RESOURCE/7, N0DE_6(1) ,39,71; 
RESOURCE/8, N0DE_7(1) ,40,72; 
RESOURCE/9, N0DE_8(1) ,41,73; 
RESOURCE/ 10 ,NGDE_9 ( 1) ,42,74; 
RESOURCE/ 11, NODE. 10(1) ,43,75; 
RESOURCE/ 12, NODE.ll(l) ,44,76; 
RESOURCE/ 13, N0DE_12(1) ,45,77; 
RESOURCE/ 14 , NODE. 13 ( 1 ) , 46 , 78 ; 
RESOURCE/ 15 , NODE. 14(1) ,47,79; 
RESOURCE/16, N0DE_15(1) ,48,80; 
RESOURCE/17, N0DE_16(1) ,49,81; 
RESOURCE/ 18, NODE. 17(1) ,50,82: 
RESOURCE/ 19 , NODE. 18 ( 1 ) , 5 1 , 83 : 
RESOURCE/20,  NODE. 19(1) ,52,84 , 
RESOURCE/21, N0DE_20(1) ,53,85 
RESOURCE/22, N0DE_21(1) ,54,86 
RESOURCE/23, N0DE_22(1) ,55,87 
RESOURCE/24, N0DE_23(1) ,56,88 
RESOURCE/25, N0DE_24(1) ,57,89 
RESOURCE/26, N0DE_25(1) ,58,90 
RESOURCE/27 .N0DE.26 ( 1) ,59,91 
RESOURCE/28, N0DE_27(1) ,60,92 
RESOURCE/29, N0DE_28(1) ,61,93 
RESOURCE/30, N0DE_29(1) ,62,94 
RESOURCE/31, N0DE_30(1) ,63,95 
RESOURCE/32 ,N0DE_31 ( 1) ,64,96 


CREATE, , ,1; 


ASSIGN, NBURSTS=5, 

NPR0CS=32 , 

MSGS=NPROCS- 1 , 
MSG_SIZE=1024, 
CMP_IIME=1000/NPRQCS , 
XMT_TIME=0 . 0008968224 , 
INT_TIME=0. 4849987, 
OVERHEADS.  6 158445; 

ASSIGN, NFINISHED=0, 

MSG _INIT= 1.667, 
MSG_COUNT=MSGS*NBURSTS , 
BURSTS_RMNG=NBURSTS , 
XX(70)=NPR0CS+1; 

> 

;  PARTITION  PROCESS  OVER  NODES 

9 

CONT  GOON, 6; 

ACT, .NPROCS.GE.l.DIMO; 

ACT, ,NPROCS .GE.2 ,DIM1 ; 

ACT, ,NPR0CS.GE.4,DIM2; 

ACT, , NPROCS . GE . 8 , DIM3 ; 

ACT, , NPROCS. GE. 16, DIM4; 

ACT, , NPROCS. GE. 32, DIM5; 

DIMO  GOON; 

ACT, , ,N0 ;N0DE_0 

DIM1  GOON; 

ACT, , ,N1 ;N0DE_1 


DIM2  GOON; 


ACT, , ,N2 ;N0DE_2 
ACT, ,,N3;N0DE_3 
DIM3  GOON; 

ACT, , ,N4; N0DE_4 
ACT, , ,N5 ;N0DE_5 
ACT, , ,N6 ;N0DE_6 
ACT, , ,N7;N0DE_7 
DIM4  GOON; 

ACT,  ,  ,N8 ;N0DE_8 
ACT, , ,N9 ;N0DE_9 
ACT,  ,  ,N10 ;N0DE_10 
ACT, , ,N11;N0DE_11 
ACT, , ,N12 ;NODE_12 
ACT, , ,N13 ;N0DE_13 
ACT, , ,N 14; NODE,  14 
ACT , , ,N15 ;N0DE_15 
DIM5  GOON; 

ACT,  , ,N16;N0DE_16 
ACT, , ,N17 ;N0DE_17 
ACT, , ,N18;N0DE_18 
ACT, , ,N19 ;N0DE_19 
ACT, , ,N20 ;N0DE_20 
ACT, , ,N21 ;N0DE_21 
ACT, , ,N22 ;N0DE_22 
ACT, , ,N23 ;N0DE_23 
ACT, , ,N24 ;N0DE_24 
ACT, , ,N25 ;N0DE_25 
ACT, , ,N26 ;N0DE_26 


5S 


V. 


ACT, , ,N27;N0DE_27 
ACT, , ,N28 ;N0DE_28 
ACT, , ,N29 ;N0DE_29 
ACT, , ,N30 ;N0DE_30 
ACT, , ,N31;N0DE_31 


NO  ASSIGN, SOURCE=l, 

CURRENT_NODE=l ; 
ACT, , , COMP; 

I 

Ml  ASSIGN, S0URCE=2, 

CURRENT_N0DE=2 ; 
ACT, , , COMP; 

i 

N2  ASSIGN, S0URCE=3, 

CURRENT_N0DE=3 ; 
ACT, ,, COMP; 

» 

N3  ASSIGN, SOURCE=4, 

CURRENT_N0DE=4 ; 
ACT, , , COMP; 


N4 


I 


I 


N5 


ASSIGN, S0URCE=5, 

CURRENT_N0DE=5 ; 
ACT, , , COMP; 

ASSIGN, S0URCE=6, 


CURRENT_N0DE=6 ; 

ACT, , ,COMP; 

> 

N6  ASSIGN, S0URCE=7, 

CURRENT_N0DE=7 ; 

ACT, , , COMP; 

) 

N7  ASSIGN, S0URCE=8, 

CURRENT_N0DE=8 ; 

ACT, , .COMP; 

> 

N8  ASSIGN, S0URCE=9, 

CURRENT_N0DE=9 ; 

ACT, , , COMP; 

f 

N9  ASSIGN, S0URCE=10, 

CURRENT_N0DE=10 ; 

ACT, , ,COMP; 

> 

NIO  ASSIGN, SOURCE=ll, 

CURRENT_NODE=l 1 ; 

ACT, , , COMP ; 

> 

Nil  ASSIGN, S0URCE=12, 

CURRENT_N0DE=12 ; 

ACT, , , COMP ; 


N12  ASSIGN ,SOURCE= 13, 


CURRENT_N0DE=13 ; 

ACT,  ,  ,CQMP ; 

ASS IGN , SOURCE= 14 , 

CURRENT_N0DE=14; 

ACT, , , COMP; 

ASSIGN, SOURCE= 15, 

CURRENT_N0DE=15 ; 

ACT, , , COMP; 

ASSIGN, S0URCE=16, 

CURRENT_N0DE=16 ; 

ACT, , , COMP; 

ASSIGN, S0URCE=17, 

CURRENT_N0DE=17 ; 

ACT, , , COMP; 

ASSIGN, SOURCE=18, 

CURRENT_N0DE=18 ; 

ACT, , , COMP; 

ASSIGN, S0URCE=19, 

CURRENT_N0DE=19 ; 

ACT, , , COMP; 


ASSIGN, S0URCE=20, 


CURRENT_N0DE=20 ; 

ACT, , , COMP; 

I 

N20  ASSIGN, S0URCE=21, 

CURRENT_N0DE=2 1 ; 

ACT, , , COMP; 

l 

N21  ASSIGN, 30URCE=22, 

CURRENT_NQDE=22 ; 

ACT, , , COMP; 

» 

N22  ASSIGN, S0URCE=23, 

CURRENT_N0DE=23 ; 

ACT, , ,COMP; 

> 

N23  ASSIGN, S0URCE=24, 

CURRENT_N0DE=24 ; 

ACT, , , COMP; 

> 

N24  ASSIGN, S0URCE=25, 

CURRENT_N0DE=25 ; 

ACT, , , COMP; 

I 

N25  ASSIGN, S0URCE=26, 

CURRENT_N0DE=26 ; 

ACT, , , COMP ; 


N26  ASSIGN, S0URCE=27, 


CURRENT_N0DE=27 ; 


ACT, , ,COMP; 


N27  ASSIGN, S0URCE=28, 


CURRENT_N0DE=28 ; 


ACT, , , COMP; 


N28  ASSIGN, S0URCE=29, 


CURRENT_N0DE=29 ; 


ACT, , , COMP; 


N29  ASSIGN, S0URCE=30, 


CURRENT_N0DE=30 ; 


ACT, , , COMP; 


N30  ASSIGN, SOURCE-31, 


CURRENT_N0DE=31; 


ACT, , , COMP; 


N31  ASSIGN, S0URCE=32, 


CURRENT_N0DE=32 ; 


ACT, , ,C0MP; 


WAIT  FOR  PROCESSOR,  COMPUTE  AND  SEND  MESSAGES  TO  ALL  OTHERS 


COMP  ASSIGN, AWA_FILE=S0URCE+64, 
XMT_QUE=S0URCE+96 , 
XMT_ACT=S0URCE+32 , 


STD_DEV=CMh_TIME*0 . 06666  f , 
TIME_RMNG=RNORM(CMP_TIME,STD_DEV, 1) ; 

BRST  AWAIT (AWA_FILE=65 , 96) , SOURCE/1; 

» 

;  EXECUTE  A  CPU  BURST 

I 

ASSIGN, C0UNT=0, 

RECEIVER* 1,1; 

ACT , , BURSTS.RMNG . EQ . 1 , LSTB , 

ACT , , BURSTS.RMNG . GT . 1 , BTM ; 

LSTB  ASSIGN , BURST_TIME=TIME_RMNG ; 

ACT , , , GA ; 

BTM  ASSIGN, MEAN_BRST_LEN=TIME_RMNG/BURSTS_RMNG, 

STD_DEV=MEAN_BRST_LEN*0 .06667 , 

BURST _TIME=RNQRM(MEAN_BRST_LEN ,STD_DEV ,2) , 1 ; 

ACT , , BURST_TIME . GE . TIME.RMNG , BTM ; 

ACT , , BURST.TIME . LT . TIME.RMNG , GA ; 

GA  GOON; 

ACT , BURST_TIME ; 

ASSIGN, TIME_RMNG=TIME_RMNG-BURST_TIME, 
BURSTS_RMNG=BURSTS_RMNG-1 , 1 ; 

ACT, ,NPR0CS .EQ . 1 .STOP ; 

ACT, , NPROCS . GT . 1 , CNT ; 

CNT  ASSIGN, C0UNT=C0UNT+1, 

LAST_MSG=USERF (3) ; 

DES  ASSIGN, DESTINATION=RECEIVER, 

RECEIVER=RECEIVER+1 , 1 ; 

ACT , .SOURCE . EQ . DESTINATION , DES ; 


XQ  QUEUE(XMT_QUE=97 , 128) ; 

ACT ( 1 ) /XMT_ACT=33 , 64 , MSG_ INIT ; 

GOON; 

ACT, ,LAST_MSG.£q.O,OK; 

ACT , , LAST.MSG . EQ . 0 , CNT ; 

ACT , , LAST.MSG . EQ . 1 , REL ; 

REL  FREE, SOURCE/ 1; 

ACT,,, OK; 

ACT , , BURSTS.RMNG . GT . 0 , BRST ; 

t 

;  GET  NEXT  NODE  AND  CHANNEL  NUMBER 

I 

OK  ASSIGN, NEXT_NODE=USERF ( 1) ; 

> 

;  PASS  MESSAGE  THROUGH  THE  NETWORK 

f 

XFER  QUEUE (CURRENT_NODE=l ,32) ; 

ACT(l) /CURRENT_NODE=l ,32 ,MSG_SIZE*XMT_TIME; 
ASSIGN , PRE_FILE=NEXT_N0DE+32 ; 
PREEMPT(PRE_FILE=33 , 64) ,NEXT_NODE; 

ACT , OVERHEAD , NEXT.NODE . Eq . DESTINATI ON , QUIT ; 
ACT , INT.TIME ,NEXT_NODE . NE . DESTINATION ; 
FREE,NEXT_NODE; 

ASSIGN , CURRENT_NODE=NEXT_NODE , 

NEXT_NODE=USERF ( 1 ) ; 

ACT, , ,XFER; 


;  ENSURE  NODES  RECEIVE  ALL  THEIR  MESSAGES 

i 

QUIT  FREE , NEXT.NODE ; 

ASSIGN , ARRAY ( 1 .SOURCE) = ARRAY ( 1 , SOURCE) +1,1; 
ACT, ,ARRAY(1, SOURCE) . LT .MSG_COUNT .KILL ; 
ACT, , ARRAY (1, SOURCE) .EQ .MSG.COUNT, STOP ; 

* 

STOP  ASSIGN, ARRAY (1, SOURCE) =0, 

NFINISHED=NFINISHED+1 , 1 ; 

ACT, .NFINISHED .LT.NPROCS.KILL; 

ACT , , NFINISHED . EQ . NPROCS , CLCT ; 

CLCT  COLCT,INT(l) ,TIME_IN_SYSTEM ; 

KILL  TERM; 

END; 

INIT.O; 

FIN; 


This  is  the  main  program  used  by  SLAM  II  which  includes  the  function  used  to  find 
the  next  node  and  channel  in  the  sender/receiver  path.  The  second  function  is  the 
calibration  function  used  when  t  he  model  was  validated. 

PROGRAM  MAIN 
DIMENSION  NSET( 1500000) 

INCLUDE  ’PARAM.INC' 

C0MM0N/SC0M1/ATRIBCMATRB) ,  DD(MEQT),  DDL(MEQT) ,  DTNOW,  II,  MFA , 
1MST0P ,NCLNR,  NCRDR,  NPRNT,  NNRUN,  NNSET ,  NTAPE ,  SS(MEQT) , 
2SSL(MEQT) ,TNEXT,  TN0W,  XX(MMXXV) 

COMMON  QSET(1500000) 

EQUIVALENCE  (NSET(l) ,QSET(1>) 

NNSET=1500000 
NCRDR=5 
NPRNT=6 
NTAPE=7 
CALL  SLAM 
STOP 
C 

END 

C 

C 

FUNCTION  USERF(I) 

C0MM0N/SC0M1/ATRIB( 100) ,DD(100) ,DDL(100) , DTNOW, II, 
lMrAjHbiU^, WCLNR , NCRDR , NPRNT , NNRUN , NNSET , NTAPE , 

2SS ( 100) ,SSL(100) ,MTEXT,TN0W,XX(100) 

C 

INTEGER* 2  CURRENT , DESTIN , PATH , NEXT , P0S , MASK , DIR 


C  Function  used  to  find  next  node  and  channel  based  on  the 
C  exclusive-or  operation  of  the  node  identification  numbers 
C  of  the  source  and  destination. 

C 

1  CURRENT=ATRIB(5)-1 

DESTIN=ATRIB (3) - 1 
P0S=0 

PATH=IIE0R(CURRENT .DESTIN) 

10  MASK=2**P0S 

NEXT=IIAND(PATH .MASK) 

IF  (NEXT  .EQ.  0)  THEN 
P0S=P0S+1 
GO  TO  10 
ELSE 

DIR=IIAND(MASK, CURRENT) 

IF  (DIR  .Eq.  0)  THEN 
USERF=CURRENT+NEXT+ 1 
ELSE 

USERF=CURRENT-NEXT+ 1 


END  IF 
END  IF 

ATRIB(9)=P0S+1 

RETURN 

C 

C  Calibration  function  used  for  model  validataion. 
C 

US 


2  IF  (XX (42) .GE.4)  THEN 

USERF=5+7 . 5*XX(42) 

ELSE 

USERF=0 
END  IF 
RETURN 

C 

C  Function  used  to  set  a  flag  indicating  whether  or  not 

C  the  last  message  has  been  sent  by  a  node. 

C 

3  IF  (ATRIB(IO) . EQ . XX(47) )  THEN 

USERF=1 

ELSE 

USERF=0 
END  IF 
END 


:A 


'l 


s 


V 

y 


65) 


Appendix  C.  Benchmark  Results 


This  appendix  lists  the  results  of  the  benchmark  program.  All  times  have  been 
rounded  to  integer  values. 


>  •'sjt  wA  >  j« 


I'lIUUUU'iMII'US 


Number 

Recalculations  of  Nodes 


C  omputat  ion 
Message'  Processing 


Runs 

1  2  3  4  5 


200  200  200  200  200 
0  0  0  0  0 


Total 

200 

200 

200 

200 

200 

('omputat  ion 

100 

100 

100 

103 

100 

Message  Processing 

5 

5 

0 

•> 

5 

Total 

105 

105 

100 

105 

105 

C’omput  at  ion 

50 

51 

50 

51 

50 

Message*  Processing 

45 

74 

45 

54 

50 

Total 

95 

125 

95 

105 

100 

Computation 

25 

26 

25 

26 

25 

Message  Processing 

70 

119 

145 

69 

120 

Total 

95 

145 

170 

95 

145 

Computation 

13 

13 

13 

13 

14 

Message  Processing 

147 

117 

157 

147 

181 

total  1  GO  1  GO  170  160  195 

Computation  6  7  6  7  6 

Message  Processing  339  333  339  368  319 


315  375 


J  J  + 


1 

Computation 

280 

280 

280 

280 

280 

Message  Processing 

0 

0 

0 

0 

0 

Total 

280 

280 

280 

280 

280 

2 

Computation 

MO 

140 

MO 

140 

140 

Message  Processing 

5 

0 

5 

0 

5 

Total 

145 

110 

145 

140 

145 

1 

Computat  ion 

70 

70 

71 

70 

70 

Message  Processing 

40 

40 

54 

40 

40 

Total 

110 

110 

125 

110 

110 

8 

Computation 

35 

35 

35 

35 

35 

Message  Processing 

115 

115 

75 

70 

70 

Total 

150 

150 

110 

105 

105 

1G 

Computation 

18 

18 

18 

18 

19 

Message  Processing 

1 67 

1G2 

152 

172 

221 

Tot  al 

185 

180 

170 

1 90 

210 

:i2 

Computation 

9 

9 

9 

9 

10 

Message  Processing 

31  1 

326 

371 

316 

280 

1  ot  al 

320 

335 

380 

325 

290 

, * »V  v V - V- '  *-V-  a >*. 


m  .  i  v  «  -  »*•  a  . 1 


. >  k>  -v  ' 


N umbi  r 
dilations  of  Node .< 
10  1 


of  Nodes 

1 

2 

3 

4 

5 

1 

Computation 

400 

400 

400 

400 

400 

Message1  Processing 

0 

0 

0 

0 

0 

Total 

400 

400 

400 

400 

400 

2 

Computation 

200 

200 

202 

200 

200 

Message  Processing 

5 

5 

3 

0 

0 

Total 

205 

205 

205 

200 

200 

1 

Computation 

100 

101 

100 

100 

100 

Message  Processing 

55 

49 

40 

45 

40 

Total 

155 

150 

140 

145 

140 

8 

Computation 

51 

51 

50 

51 

50 

Message  Processing 

71 

74 

70 

69 

70 

Total 

125 

125 

120 

120 

120 

16 

Computation 

26 

25 

25 

25 

25 

Message  Processing 

149 

175 

155 

175 

185 

lot  al 

175 

200 

180 

200 

210 

82 

Computation 

13 

13 

12 

13 

13 

Message  Processing 

342 

302 

333 

327 

372 

1  ot  al 

355 

3 1 5 

345 

310 

385 

m 


VLV.>;S>V' «- •  ^  '  *«  ^  •  "  -  ■>  *>  '"  ‘  •-  *-■  ■-  -  ^  ^  ^  '  *-  *A"-  A*  ^  ‘A. 


Number 

Recalculations  of  Nodes 


C  1 


Runs 
3  4 


Computation 

610 

CIO 

CIO 

64  0 

640 

Message  Processing 

0 

0 

0 

0 

0 

'lot  al 

CIO 

CIO 

610 

640 

640 

Computation 

320 

322 

320 

320 

322 

Message  Processing 

0 

3 

5 

0 

3 

l  ot  al 

320 

325 

325 

320 

325 

Computation 

ICO 

1  Cl 

ICO 

161 

1C1 

Message  Processing 

40 

49 

50 

9 

54 

Total 

200 

210 

210 

170 

1  2 

1  w 

Computation 

80 

81 

81 

81 

80 

Message  Processing 

120 

104 

59 

104 

110 

Total 

200 

185 

140 

185 

190 

( 'omput  ation 

40 

40 

40 

40 

40 

Message  Processing 

120 

140 

115 

1 65 

190 

Total 

ICO 

180 

185 

205 

230 

( Omput  at  ion 

21 

20 

20 

20 

20 

Message  Processing 

200 

205 

355 

335 

325 

1  o!  al 

320 

3 1  5 

375 

355 

315 

j*  ->  -v* 


Number 

Recalculations  of  Nodes 
20  1 


Nodes 

1 

2 

3 

4 

5 

1 

Computation 

800 

800 

800 

800 

800 

Message  Processing 

0 

0 

0 

0 

0 

Total 

800 

800 

800 

800 

800 

2 

Computat  ion 

403 

400 

403 

400 

400 

Message  Processing 

2 

5 

2 

5 

5 

Total 

405 

105 

405 

405 

405 

4 

Computation 

203 

203 

203 

200 

201 

Message  Processing 

12 

47 

82 

50 

44 

Total 

215 

250 

285 

250 

245 

8 

Computation 

101 

101 

101 

101 

100 

Message  Processing 

GO 

74 

70 

00 

110 

Total 

170 

175 

180 

200 

210 

10 

Computation 

51 

51 

51 

51 

50 

Message  Processing 

184 

151 

164 

174 

155 

lot  al 

235 

205 

215 

225 

205 

32 

Computat  ion 

25 

26 

20 

25 

25 

Message  Processing 

305 

330 

32!) 

340 

310 

I  ot  ill 


330  3G5  355  305  335 


A  umbf  i 

Recalculations  of  A 7 ode. 


Computation 
Message*  Processin 


Compulation 
Message  Processing 


110 

395 

390 

40C 

IG5 

164 

165 

161 

LG  5 

191 

155 

151 

(  omput  at  ion 
Message  Processing 


1  355 

321 

!  83 

8: 

'  317 

31 

Computation 

2020 

2615 

2615 

2615 

2610 

1 

Message  Processing 

0 

0 

0 

0 

0 

1 

total 

2620 

2615 

2615 

2615 

2610 

Computation 

1308 

1305 

1308 

1308 

1305 

' 3 

? 

Message  Processing 

7 

5 

2 

2 

5 

Total 

1315 

1310 

1310 

1310 

1310 

V 
'J 
'J 

V 

Computation 

655 

651 

655 

655 

654 

• 

M 

A 

Message  Processing 

45 

51 

50 

10 

46 

i 

Ij 

i! 


5 


Computation  3620 

Message  Processing  0 


3620  3620  3620  3620  3620 
0  0  0  0  0 


Total 

3620 

3620 

3620 

3620 

3620 

Computation 

1808 

1810 

1810 

1810 

1810 

Message  Processing 

2 

5 

0 

0 

5 

Total 

1810 

1815 

1810 

1810 

1815 

Computation 

905 

905 

905 

905 

905 

Message  Processing 

10 

55 

40 

40 

50 

Computation 
Message  Processing 


Computation 
Message  Processing 


Number 

Recalculations  of  Nodes 


Computation 


Message  Processing 


Computation 


Message  Processing 


C'omputat  ion 


Runs 

12  3  4  5 


8010  8010  8010  8010  8010 
0  0  0  0  0 


8010  8010  8010  8010  8010 

4005  4005  4005  4005  4002 

5  5  5  5  3 


4010  4010  1410  1410  1405 

2005  2005  2005  2005  2005 


Message  Processing  50  50  50 


Total 

2055 

2055 

2055 

2055 

2060 

Computation 

1005 

1004 

1004 

1004 

1003 

Message  Processing 

95 

96 

96 

56 

112 

Total 

1 100 

1100 

1100 

1060 

1115 

Computation 

505 

505 

505 

505 

505 

Message  Processing 

115 

160 

190 

1 55 

150 

C  omputation 
Message  Processin 


Appendix  D.  Model  Validation 


This  appendix  lists  the  simulation  results  obtained  using  the  four  different 
uniprocessor  loads  that  are  listed  in  fable  3.  The  calibration  function  was  incor¬ 
porated  into  the  simulation  for  t lie*  purpose  of  validation.  The  results  from  the 
simulation  and  the  benchmark  are  averaged  across  each  uniprocessor  load  and  cube 
size.  A  paired-difference  t  test  was  conducted  using  the  average  times. 


Appendix  E.  Simulation  Results 


This  appendix  lists  the  total  execution  times  obtained  from  the  simulator  for 
each  experimental  unit  listed  in  Table  5. 
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~  This  investigated  the  relationship  between  workload 

characteristics  and  process  speedup.  There  were  two  goals: 
the  first  was  to  determine  the  functional  relationship  between 
workload  characteristics  and  speedup,  and  the  second  was  to 
show  how  simulation  could  be  used  to  determine  such  a  rela¬ 
tionship.  The  hypercube  implementation  used  in  this  study 
is  a  packet-switched  network  with  predetermined  routing. 
Message  processing  has  precedence,  so  nodes  are  interrupted 
during  task  processing.  \ 


In  this  study three  independent  variables  were  con¬ 
trolled:  total  computational  workload,  number  of  nodes 

and  the  message  traffic  load.  The  workload  was  assumed  to 
be  balanced  across  the  nodes.  A  benchmark  program  was  exe¬ 
cuted  on  an  actual  hypercube  and  the  results  were  used  to 
validate  a  discrete  event  simulation  model  of  hypercube  pro¬ 
cessing.  Using  the  simulation,  an  experiment ’was  designed 
to  control  the  total  computational  load  over  two  levels,  the 
number  of  nodes  over  five  levels  and  the  message  traffic  load 
over  four  levels  to  determine  their  individual  and  inter¬ 
active  effects  on  process  speedup.  r~— 

Regression  analysis  was  used  to  estimate  the  functional 
relationship  between  the  three  independent  variables  and 
process  speedup.  The  results  show  that  a  complex  relation¬ 
ship  exists  between  workload  characteristics  and  cube  size. 

As  more  nodes  are  added,  the  computational  time  decreases, 
but  at  the  same  time,  the  communications  overhead  increases 
such  that  the  speedup  will  eventually  begin  to  decrease.  The 
point  where  speedup  starts  to  decline  is  dependent  upon  both 
the  computational  and  message  traffic  workload.  Finally,  this 
research  presented  an  alternative  methodology  for  performance 
analysis  which  is  more  flexible  than  the  traditional  methods. 
Furthermore,  this  methodology  can  be  extended  to  study  other 
architectures . 


